The concepts outlined by the SQL code below are taken from this LinkedIn learning course, “SQL for Data Analysis” - https://www.linkedin.com/learning/sql-for-data-analysis

knitr::include_graphics("C:/Users/gam55/Downloads/gravity_bookstore_erd.png")


select * from book
Displaying records 1 - 10
book_id title isbn13 language_id num_pages publication_date publisher_id
1 The World’s First Love: Mary Mother of God 8987059752 2 276 1996-09-01 1010
2 The Illuminati 20049130001 1 352 2004-10-04 1967
3 The Servant Leader 23755004321 1 128 2003-03-11 1967
4 What Life Was Like in the Jewel in the Crown: British India AD 1600-1905 34406054602 1 168 1999-09-01 1978
5 Cliffs Notes on Aristophanes’ Lysistrata The Birds The Clouds The Frogs 49086007763 1 80 1983-12-29 416
6 Life Is a Dream and Other Spanish Classics (Eric Bentley’s Dramatic Repertoire) - Volume II 73999140774 1 298 2000-04-01 96
7 William Goldman: Four Screenplays 73999254907 2 504 2000-05-01 95
8 The Season: A Candid Look at Broadway 73999768442 1 448 2004-07-01 1136
9 The Beatles Complete - Updated Edition 73999960822 1 303 1986-12-01 835
10 Working Effectively with Legacy Code 76092025986 1 464 2004-09-01 1591
select * from book_language
Displaying records 1 - 10
language_id language_code language_name
1 eng English
2 en-US United States English
3 fre French
4 spa Spanish
5 en-GB British English
6 mul Multiple Languages
7 grc Greek
8 enm Middle English
9 en-CA Canadian English
10 ger German

1 Working with Dates

Date and Time functions in SQLite

Key date and time functions in SQLite -

  • DATE() - While SQLite doesn’t have a TIMESTAMP function like in MySQL, this function returns the date in ‘YYYY-MM-DD’ format.
SELECT DATE('now') AS "Current Date"; -- returns the current date
1 records
Current Date
2024-07-05
  • DATETIME() - This function returns the date and time in ‘YYYY-MM-DD HH:MM’ format.

SELECT DATETIME('now') AS "Current Date and Time"; -- returns the current date and time
1 records
Current Date and Time
2024-07-05 08:21:00
  • YEAR(): SQLite doesn’t have a YEAR function, but one can use the STRFTIME function to extract the year from a date.
SELECT STRFTIME('%Y', 'now') AS "Current Year"; -- returns the current year
1 records
Current Year
2024
  • TIME(): Returns the time in ‘HH:MM’ format.
SELECT TIME('now') as "Current Time"; -- returns the current time
1 records
Current Time
08:21:00
SELECT CURRENT_DATE as "Current Date"; -- returns the current date
1 records
Current Date
2024-07-05
SELECT CURRENT_TIME as "Current Time"; -- returns the current time
1 records
Current Time
08:21:00
SELECT CURRENT_TIMESTAMP as "Current Time and Date"; -- returns the current date and time
1 records
Current Time and Date
2024-07-05 08:21:00

Manipulating dates in ‘Bookstore’ Tables


SELECT status_date as "Status Date" from order_history;
Displaying records 1 - 10
Status Date
2023-07-19 15:20:03
2023-12-12 20:06:08
2023-12-08 20:19:19
2023-07-12 18:38:30
2024-04-15 04:33:17
2023-12-19 09:43:57
2024-01-06 22:39:56
2023-07-25 00:39:09
2023-09-29 13:29:48
2023-11-14 17:13:49

SELECT customer_id as "Customer ID", order_date as "Order Date" from cust_order WHERE order_date BETWEEN '2023-11-01'
and '2024-05-01' ORDER BY order_date DESC;
Displaying records 1 - 10
Customer ID Order Date
1240 2024-04-30 23:45:50
241 2024-04-30 23:18:12
155 2024-04-30 21:43:09
59 2024-04-30 19:32:57
79 2024-04-30 19:09:28
109 2024-04-30 19:01:49
1367 2024-04-30 18:50:07
21 2024-04-30 17:35:08
45 2024-04-30 17:34:56
383 2024-04-30 16:41:12

SELECT customer_id as “Customer ID”, order_date as “Order Date” from cust_order WHERE order_date < Now();

The above SQL code isn’t possible as there isn’t a Now() function in SQLite.


SELECT customer_id as "Customer ID", order_date as "Order Date" from cust_order WHERE order_date < CURRENT_TIMESTAMP;
Displaying records 1 - 10
Customer ID Order Date
2 2023-07-19 11:32:03
2 2023-12-12 14:06:08
3 2023-12-08 19:55:19
4 2023-07-12 15:26:30
4 2024-04-14 22:57:17
4 2023-12-19 07:43:57
5 2024-01-06 14:03:56
6 2023-07-24 13:27:09
6 2023-09-29 02:29:48
7 2023-11-14 16:13:49

SELECT customer_id as "Customer ID", order_date as "Order Date" from cust_order WHERE order_date > CURRENT_TIMESTAMP;
0 records
Customer ID Order Date

SQLite doesn’t have a YEAR() function but it does have a TIME() function to use.

SELECT customer_id as "Customer ID", order_date as "Order Date" from cust_order WHERE TIME(order_date) > TIME(CURRENT_TIMESTAMP);
Displaying records 1 - 10
Customer ID Order Date
2 2023-07-19 11:32:03
2 2023-12-12 14:06:08
3 2023-12-08 19:55:19
4 2023-07-12 15:26:30
4 2024-04-14 22:57:17
5 2024-01-06 14:03:56
6 2023-07-24 13:27:09
7 2023-11-14 16:13:49
7 2023-12-23 11:15:16
8 2023-07-04 10:31:55

2 - Common SQL String Functions

SUBSTRING()

The function SUBSTRING() returns a part of a character string SUBSTRING(string, start, length)


select title as "Book Title" from book WHERE SUBSTRING(Title,1,2) = 'Ro'
Displaying records 1 - 10
Book Title
Roverandom
Rough Crossings: Britain the Slaves and the American Revolution
Romanticism
Ronia the Robber’s Daughter
Robinson Crusoe
Rock Star Superstar
Rounding the Mark (Inspector Montalbano #7)
Roald Dahl: A Biography
Rose of No Man’s Land
Robinson Crusoe

CONCAT()

The CONCAT(string1, string2,….,string_n) function joins two or more strings together


select CONCAT(first_name,' ',last_name) AS "Full Name" from customer
Displaying records 1 - 10
Full Name
Ursola Purdy
Ruthanne Vatini
Reidar Turbitt
Rich Kirsz
Carline Kupis
Kandy Adamec
Jermain Giraudeau
Nolly Bonicelli
Phebe Curdell
Euell Guilder

Double pipes, || , can also be used in place of CONCAT(), in SQLite.


SELECT first_name || ' ' || last_name AS "Full Name"
FROM customer;
Displaying records 1 - 10
Full Name
Ursola Purdy
Ruthanne Vatini
Reidar Turbitt
Rich Kirsz
Carline Kupis
Kandy Adamec
Jermain Giraudeau
Nolly Bonicelli
Phebe Curdell
Euell Guilder

UPPER() and LOWER()

The UPPER() and LOWER() functions will return strings as upper case or lower case, respectively.

select UPPER(CONCAT(first_name,' ',last_name)) AS "Full Name" from customer
Displaying records 1 - 10
Full Name
URSOLA PURDY
RUTHANNE VATINI
REIDAR TURBITT
RICH KIRSZ
CARLINE KUPIS
KANDY ADAMEC
JERMAIN GIRAUDEAU
NOLLY BONICELLI
PHEBE CURDELL
EUELL GUILDER
select LOWER(CONCAT(first_name,' ',last_name)) AS "Full Name" from customer
Displaying records 1 - 10
Full Name
ursola purdy
ruthanne vatini
reidar turbitt
rich kirsz
carline kupis
kandy adamec
jermain giraudeau
nolly bonicelli
phebe curdell
euell guilder

REPLACE()

The REPLACE() function replaces a substring with another substring - REPLACE(string, old_string, new_string)

select REPLACE(language_id, '1', "English") as Language from book where language_id = '1'
Displaying records 1 - 10
Language
English.0
English.0
English.0
English.0
English.0
English.0
English.0
English.0
English.0
English.0

Using TRIM, LTRIM, and RTRIM in SQLite


select title as "Title" FROM book
Displaying records 1 - 10
Title
The World’s First Love: Mary Mother of God
The Illuminati
The Servant Leader
What Life Was Like in the Jewel in the Crown: British India AD 1600-1905
Cliffs Notes on Aristophanes’ Lysistrata The Birds The Clouds The Frogs
Life Is a Dream and Other Spanish Classics (Eric Bentley’s Dramatic Repertoire) - Volume II
William Goldman: Four Screenplays
The Season: A Candid Look at Broadway
The Beatles Complete - Updated Edition
Working Effectively with Legacy Code

TRIM: Removes leading and trailing characters (default is space).


select TRIM(title)  as "Trimmed Title" FROM book
Displaying records 1 - 10
Trimmed Title
The World’s First Love: Mary Mother of God
The Illuminati
The Servant Leader
What Life Was Like in the Jewel in the Crown: British India AD 1600-1905
Cliffs Notes on Aristophanes’ Lysistrata The Birds The Clouds The Frogs
Life Is a Dream and Other Spanish Classics (Eric Bentley’s Dramatic Repertoire) - Volume II
William Goldman: Four Screenplays
The Season: A Candid Look at Broadway
The Beatles Complete - Updated Edition
Working Effectively with Legacy Code

TRIM with specific characters: Removes specified characters from both ends.


SELECT TRIM(title, 'The') AS "Trimmed Title" FROM book;
Displaying records 1 - 10
Trimmed Title
World’s First Love: Mary Mother of God
Illuminati
Servant Leader
What Life Was Like in the Jewel in the Crown: British India AD 1600-1905
Cliffs Notes on Aristophanes’ Lysistrata The Birds The Clouds The Frogs
Life Is a Dream and Other Spanish Classics (Eric Bentley’s Dramatic Repertoire) - Volume II
William Goldman: Four Screenplays
Season: A Candid Look at Broadway
Beatles Complete - Updated Edition
Working Effectively with Legacy Cod

LTRIM with specific characters: Removes specified leading characters.

SELECT LTRIM(title, 'What') AS "Trimmed Title" FROM book;
Displaying records 1 - 10
Trimmed Title
The World’s First Love: Mary Mother of God
The Illuminati
The Servant Leader
Life Was Like in the Jewel in the Crown: British India AD 1600-1905
Cliffs Notes on Aristophanes’ Lysistrata The Birds The Clouds The Frogs
Life Is a Dream and Other Spanish Classics (Eric Bentley’s Dramatic Repertoire) - Volume II
illiam Goldman: Four Screenplays
The Season: A Candid Look at Broadway
The Beatles Complete - Updated Edition
orking Effectively with Legacy Code
SELECT LTRIM(title, 'The') AS "Trimmed Title" FROM book;
Displaying records 1 - 10
Trimmed Title
World’s First Love: Mary Mother of God
Illuminati
Servant Leader
What Life Was Like in the Jewel in the Crown: British India AD 1600-1905
Cliffs Notes on Aristophanes’ Lysistrata The Birds The Clouds The Frogs
Life Is a Dream and Other Spanish Classics (Eric Bentley’s Dramatic Repertoire) - Volume II
William Goldman: Four Screenplays
Season: A Candid Look at Broadway
Beatles Complete - Updated Edition
Working Effectively with Legacy Code

RTRIM with specific characters: Removes specified trailing characters.

SELECT RTRIM(title, 's') AS "Trimmed Title" FROM book;
Displaying records 1 - 10
Trimmed Title
The World’s First Love: Mary Mother of God
The Illuminati
The Servant Leader
What Life Was Like in the Jewel in the Crown: British India AD 1600-1905
Cliffs Notes on Aristophanes’ Lysistrata The Birds The Clouds The Frog
Life Is a Dream and Other Spanish Classics (Eric Bentley’s Dramatic Repertoire) - Volume II
William Goldman: Four Screenplay
The Season: A Candid Look at Broadway
The Beatles Complete - Updated Edition
Working Effectively with Legacy Code

LTRIM and RTRIM can be used in combination with one another -

SELECT RTRIM(LTRIM(title, 'The'), 's') AS "Trimmed Title" FROM book;
Displaying records 1 - 10
Trimmed Title
World’s First Love: Mary Mother of God
Illuminati
Servant Leader
What Life Was Like in the Jewel in the Crown: British India AD 1600-1905
Cliffs Notes on Aristophanes’ Lysistrata The Birds The Clouds The Frog
Life Is a Dream and Other Spanish Classics (Eric Bentley’s Dramatic Repertoire) - Volume II
William Goldman: Four Screenplay
Season: A Candid Look at Broadway
Beatles Complete - Updated Edition
Working Effectively with Legacy Code

Combining different String Functions


select UPPER(CONCAT(street_number,' ',street_name,' ',city)) as "Address" from address LIMIT 6
6 records
Address
57.0 GLACIER HILL AVENUE TORBAT-E JāM
86.0 DOTTIE JUNCTION BEAUMONT
292.0 RAMSEY AVENUE CAYAMBE
5618.0 THACKERAY JUNCTION CALDAS
4.0 2ND PARK NGUNGURU
387.0 NANCY JUNCTION BURIRAO

Create Read Update Delete (CRUD) Functions

select * from Publisher
Displaying records 1 - 10
publisher_id publisher_name
1 10/18
2 1st Book Library
3 1st World Library
4 A & C Black (Childrens books)
5 A Harvest Book/Harcourt Inc.
6 A K PETERS
7 AA World Services
8 Abacus
9 Abacus Books
10 Abacus Books (London)

INSERT INTO Publisher (publisher_id, publisher_name) VALUES ('1001', 'UoG Press')

select * from Publisher WHERE publisher_name == "UoG Press"
1 records
publisher_id publisher_name
1001 UoG Press

UPDATE Publisher SET publisher_id = '1002' WHERE publisher_name == "UoG Press"

select * from Publisher WHERE publisher_name == "UoG Press"
1 records
publisher_id publisher_name
1002 UoG Press

DELETE FROM Publisher WHERE publisher_id = '1002'

select * from Publisher WHERE publisher_name == "UoG Press"
0 records
publisher_id publisher_name

3 - Visualising Data

Python Method from the LinkedIn Learning Course

import pandas as pd

import sqlalchemy as sa

import plotly.express as px

import plotly.graph_objects as go

from sqlalchemy.engine import create_engine
# 
# def run_query(query):
#     return pd.read_sql(query, con=connection)
# 
# df = run_query("SELECT * FROM cust_order",con=connection )

Bar Charts

#creating a new column to mark each order as a 1#
cust_order <- cust_order %>%
  mutate(number_orders = 1)

#converting the order date column into a date format#
cust_order$order_date <- as.Date(cust_order$order_date)

cust_order$order_date <- ymd(cust_order$order_date)

#simply bar chart of the no.orders, put into an object to be run through plotly#
p <- ggplot(cust_order, aes(x = order_date, y = number_orders)) +
    geom_bar(stat = "identity", fill = "steelblue") +
    theme_minimal() +
    labs(title = " Simple Bar Chart", x = "Order Date", y = "No.Orders") +
    theme(axis.text.x = element_text(angle = 45, hjust = 1))


#display the bar chart#
p

We can use the ggplotly from the plotly package to make the plot interactive.

p <- ggplotly(p)

p
cust_order_monthly  <- cust_order %>% 
  select(order_date, number_orders) %>%
  group_by(month = lubridate::floor_date(`order_date`, 'month'))  %>%
  summarise("no_orders_monthly" = sum(number_orders))


p <- ggplot(cust_order_monthly , aes(x = month, y = no_orders_monthly)) +
    geom_bar(stat = "identity", fill = "steelblue") +
    theme_minimal() +
    labs(title = "Monthly orders from the 'Cust_order' Table", x = "Order Date", y = "No.Orders") +
    theme(axis.text.x = element_text(angle = 45, hjust = 1))

p

Making the plot interactive -

p <- ggplotly(p)

p

Dygraphs

https://rstudio.github.io/dygraphs/ - An extremely useful way of visualising longitudinal data in an interactive, highly-customisable manner.

don <- xts(x = cust_order_monthly$no_orders_monthly, order.by = cust_order_monthly$month)

p <- dygraph(don,xlab = "Order Date", ylab = "No.Orders",main = "Monthly No.Orders from 'cust_order' Table ") %>%
  dySeries("V1", label = "No.Orders") %>%
  dyRangeSelector() %>%
  dyUnzoom() 
 


p
#Create 'Unzoom' function#

dyUnzoom <-function(dygraph) {
  dyPlugin(
    dygraph = dygraph,
    name = "Unzoom",
    path = system.file("plugins/unzoom.js", package = "dygraphs")
  )
}

cust_order_daily  <- cust_order %>% 
  select(order_date, number_orders) %>%
  group_by(day = lubridate::floor_date(`order_date`, 'day'))  %>%
  summarise("no_orders_daily" = sum(number_orders))

cust_order_daily$day <- as.POSIXct(cust_order_daily$day)


don <- xts(x = cust_order_daily$no_orders_daily, order.by = cust_order_daily$day)

p <- dygraph(don,xlab = "Order Date", ylab = "No.Orders",main = "Daily No.Orders from 'cust_order' Table ") %>%
  dySeries("V1", label = "No.Orders") %>%
  dyRangeSelector() %>%
  dyUnzoom() 
 


p

The time series can be ‘filled’ using the dyoptions() argument ‘fillgraph’ (TRUE/FALSE).

p <- dygraph(don,xlab = "Order Date", ylab = "No.Orders",main = "Daily No.Orders from 'cust_order' Table ") %>%
  dySeries("V1", label = "No.Orders") %>%
    dyOptions(fillGraph = TRUE, fillAlpha = 0.1) %>%
  dyRangeSelector() %>%
  dyUnzoom() 
 

p

The function dyEvent() allows vertical lines with a label to be added. dyAnnotation() generates annotations directly on the time series line. Using a custom function, a larger annotation and attached to the x-axis is also possible, presAnnotation() . dyshading() allows whole areas of the graph to be shaded -

presAnnotation <- function(dygraph, x, text) {
  dygraph %>%
    dyAnnotation(x, text, attachAtBottom = TRUE, width = 60)
}


p <- dygraph(don,xlab = "Order Date", ylab = "No.Orders",main = "Daily No.Orders from 'cust_order' Table ") %>%
  dySeries("V1", label = "No.Orders") %>%
    dyOptions(fillGraph = TRUE, fillAlpha = 0.1) %>%
  dyRangeSelector() %>%
  dyEvent("2023-08-24", "Andy's Order", labelLoc = "bottom") %>%
  dyEvent("2023-10-06", "Bob's Order", labelLoc = "bottom") %>%
  dyAnnotation("2023-09-13", text = "Sep.", tooltip = "September 241 Offer") %>%
  dyAnnotation("2023-11-13", text = "Nov.",tooltip = "November Multibuy Offer" ) %>%
  presAnnotation("2024-05-29", text = "BHol" ) %>%
  dyShading(from = "2024-04-01", to = "2024-05-01", color = "#CCEBD6") %>%
  dyUnzoom() 
 

p

The function dyLimit() allows line limits to be added.

p <- dygraph(don,xlab = "Order Date", ylab = "No.Orders",main = "Daily No.Orders from 'cust_order' Table ") %>%
  dySeries("V1", label = "No.Orders") %>%
  dyOptions(fillGraph = TRUE, fillAlpha = 0.1) %>%
  dyRangeSelector() %>%
  dyLimit(8, color = "blue") %>%
  dyLimit(40, color = "blue") %>%
  dyUnzoom() 
 

p

4 - Indexes & Views

This section and the following take inspiration but not much content, from the LinkedIn learning course “Intermediate SQL for Data Scientists” - https://www.linkedin.com/learning/intermediate-sql-for-data-scientists

knitr::include_graphics("C:/Users/gam55/Downloads/gravity_bookstore_erd.png")

Indexes

Indexes in SQLite are structures designed to improve the speed of data retrieval. They’re similar to indexes in a book, allowing for faster access to rows in a table based on the values of one or more columns.

Types of Indexes -

  1. Single-Column Index - Created on a single column.

  2. Multi-Column Index - Created on two or more columns.

  3. Unique Index - Ensures that all values in the indexed column(s) are unique.

  4. Primary Key Index - Automatically created when a table has a primary key constraint.

  5. Automatic Indexes - Sometimes SQLite creates indexes automatically to optimize query performance, particularly for joins.


CREATE INDEX Customer_IDs ON cust_order(customer_id); 

CREATE INDEX Customer_Orders ON cust_order(order_id, customer_id);

CREATE UNIQUE INDEX unique_countries ON country(country_name);

This SQL code above fails - A unique index does not discard non-unique values or automatically filter out duplicates. Instead, it enforces a constraint that prevents duplicates from being inserted into the table in the first place.


CREATE UNIQUE INDEX unique_status_ids ON order_status(status_id);

List all indexes associated with a table -

PRAGMA index_list(cust_order);
2 records
seq name unique origin partial
0 Customer_Orders 0 c 0
1 Customer_IDs 0 c 0

Get information about a specific index -


PRAGMA index_info(Customer_Orders);
2 records
seqno cid name
0 0 order_id
1 2 customer_id

Indexes can be dropped using the DROP INDEX statement -


DROP INDEX Customer_Orders;

Checking the index deletion has worked -

PRAGMA index_list(cust_order);
1 records
seq name unique origin partial
0 Customer_IDs 0 c 0

An example with a more complex and intricate query -


select * from book
Displaying records 1 - 10
book_id title isbn13 language_id num_pages publication_date publisher_id
1 The World’s First Love: Mary Mother of God 8987059752 2 276 1996-09-01 1010
2 The Illuminati 20049130001 1 352 2004-10-04 1967
3 The Servant Leader 23755004321 1 128 2003-03-11 1967
4 What Life Was Like in the Jewel in the Crown: British India AD 1600-1905 34406054602 1 168 1999-09-01 1978
5 Cliffs Notes on Aristophanes’ Lysistrata The Birds The Clouds The Frogs 49086007763 1 80 1983-12-29 416
6 Life Is a Dream and Other Spanish Classics (Eric Bentley’s Dramatic Repertoire) - Volume II 73999140774 1 298 2000-04-01 96
7 William Goldman: Four Screenplays 73999254907 2 504 2000-05-01 95
8 The Season: A Candid Look at Broadway 73999768442 1 448 2004-07-01 1136
9 The Beatles Complete - Updated Edition 73999960822 1 303 1986-12-01 835
10 Working Effectively with Legacy Code 76092025986 1 464 2004-09-01 1591

select author_name from author where author_name == "Walter Scott"
1 records
author_name
Walter Scott

Creating an index of author names.


CREATE INDEX Author_names ON author(author_name);

Now our ‘sever’ should use that index to optimise the performance of this query.


SELECT
    b.title,
    b.isbn13,
    b.num_pages,
    b.publication_date,
    a.author_name,
    ol.price
FROM book b
INNER JOIN book_author ba ON b.book_id = ba.book_id
INNER JOIN author a ON ba.author_id = a.author_id
INNER JOIN order_line ol ON b.book_id = ol.book_id
INNER JOIN cust_order co ON ol.order_id = co.order_id
WHERE a.author_name IN ('J.K. Rowling', 'Bill Bryson')
  AND strftime('%Y', b.publication_date) IN ('1997','1998', '1999','2000')
ORDER BY ol.price DESC;
8 records
title isbn13 num_pages publication_date author_name price
Harry Potter and the Goblet of Fire (Harry Potter #4) 9.780748e+12 636 2000-07-08 J.K. Rowling 15.04
Bill Bryson: The Complete Notes 9.780386e+12 544 2000-10-05 Bill Bryson 13.73
The Lost Continent: Travels in Small-town America 9.780553e+12 379 1999-01-02 Bill Bryson 12.07
Harry Potter and the Sorcerer’s Stone (Harry Potter #1) 9.780786e+12 424 1999-11-12 J.K. Rowling 9.99
The Lost Continent: Travels in Small-town America 9.780553e+12 379 1999-01-02 Bill Bryson 9.60
I’m a Stranger Here Myself: Notes on Returning to America After Twenty Years Away 9.780768e+12 304 2000-06-28 Bill Bryson 9.40
Bill Bryson: The Complete Notes 9.780386e+12 544 2000-10-05 Bill Bryson 2.71
Notes from a Small Island 9.780381e+12 324 1997-05-28 Bill Bryson 1.87

Views

Views in SQLite are virtual tables that provide a way to represent the results of a query as a table.

Reasons to use Views -

  1. Simplify Complex Queries - By encapsulating complex joins and calculations within a view, queries can be simpler and more understandable.

  2. Enhance Security - Views can restrict access to specific data by exposing only certain columns or rows to users.

  3. Provide Abstraction - Offer a layer of abstraction, allowing changes in the underlying database schema without affecting the end users.

  4. Improve Maintainability - Views centralise query logic, making the system easier to maintain and modify.

CREATE VIEW Bryson_Books AS
SELECT
    b.title,
    b.isbn13,
    b.num_pages,
    b.publication_date,
    a.author_name,
    ol.price
FROM book b
INNER JOIN book_author ba ON b.book_id = ba.book_id
INNER JOIN author a ON ba.author_id = a.author_id
INNER JOIN order_line ol ON b.book_id = ol.book_id
INNER JOIN cust_order co ON ol.order_id = co.order_id
WHERE a.author_name IN ('Bill Bryson');
CREATE VIEW Rowling_Books AS
SELECT
    b.title,
    b.isbn13,
    b.num_pages,
    b.publication_date,
    a.author_name,
    ol.price
FROM book b
INNER JOIN book_author ba ON b.book_id = ba.book_id
INNER JOIN author a ON ba.author_id = a.author_id
INNER JOIN order_line ol ON b.book_id = ol.book_id
INNER JOIN cust_order co ON ol.order_id = co.order_id
WHERE a.author_name IN ('J.K. Rowling');
CREATE VIEW Walter_Scott_Books AS
SELECT
    b.title,
    b.isbn13,
    b.num_pages,
    b.publication_date,
    a.author_name,
    ol.price
FROM book b
INNER JOIN book_author ba ON b.book_id = ba.book_id
INNER JOIN author a ON ba.author_id = a.author_id
INNER JOIN order_line ol ON b.book_id = ol.book_id
INNER JOIN cust_order co ON ol.order_id = co.order_id
WHERE a.author_name IN ('Walter Scott');
SELECT * FROM Bryson_Books;
Displaying records 1 - 10
title isbn13 num_pages publication_date author_name price
Bizarre World 9.780752e+12 120 2001-05-01 Bill Bryson 17.32
The Lost Continent: Travels in Small Town America 9.780061e+12 299 1990-08-28 Bill Bryson 14.36
Neither Here nor There: Travels in Europe 9.780381e+12 254 1993-03-28 Bill Bryson 12.47
Made in America: An Informal History of the English Language in the United States 9.780381e+12 364 2001-10-23 Bill Bryson 9.28
Notes from a Small Island 9.780381e+12 324 1997-05-28 Bill Bryson 1.87
Bill Bryson: The Complete Notes 9.780386e+12 544 2000-10-05 Bill Bryson 2.71
Bill Bryson: The Complete Notes 9.780386e+12 544 2000-10-05 Bill Bryson 13.73
A Short History of Nearly Everything (Illustrated Edition) 9.780386e+12 624 2010-10-05 Bill Bryson 5.64
A Short History of Nearly Everything (Illustrated Edition) 9.780386e+12 624 2010-10-05 Bill Bryson 19.05
Down Under 9.780553e+12 398 2001-08-06 Bill Bryson 10.92
SELECT * FROM Walter_Scott_Books;
6 records
title isbn13 num_pages publication_date author_name price
The Antiquary 9.780193e+12 528 2002-05-23 Walter Scott 10.66
The Antiquary 9.780193e+12 528 2002-05-23 Walter Scott 15.75
The Antiquary 9.780193e+12 528 2002-05-23 Walter Scott 0.84
The Antiquary 9.780193e+12 528 2002-05-23 Walter Scott 13.99
The Castle of Otranto 9.780193e+12 125 1998-07-16 Walter Scott 3.56
Waverley 9.780193e+12 463 1998-08-20 Walter Scott 1.96

Views themselves are not directly updatable, but they can be dropped and recreated -

DROP VIEW IF EXISTS Bryson_Books;
SELECT name FROM sqlite_master WHERE type='view';
2 records
name
Rowling_Books
Walter_Scott_Books
SELECT sql FROM sqlite_master WHERE type='view' AND name='Walter_Scott_Books';
1 records
sql
CREATE VIEW Walter_Scott_Books AS

SELECT b.title, b.isbn13, b.num_pages, b.publication_date, a.author_name, ol.price FROM book b INNER JOIN book_author ba ON b.book_id = ba.book_id INNER JOIN author a ON ba.author_id = a.author_id INNER JOIN order_line ol ON b.book_id = ol.book_id INNER JOIN cust_order co ON ol.order_id = co.order_id WHERE a.author_name IN (‘Walter Scott’) |

5 - Statistical aggregate functions

SUM(), AVG(), ROUND()

The SUM() function is an aggregate function that calculates the total sum of a numeric column. The function is commonly used in conjunction with the GROUP BY clause to calculate sums for specific groups of data.

knitr::include_graphics("C:/Users/gam55/Downloads/gravity_bookstore_erd.png")

SELECT SUM(street_name) as "No.different Street Names", SUM(city) as "No.different Cities"  FROM address;
1 records
No.different Street Names No.different Cities
87 0

As street_name and city are text columns, sum won’t work properly on them. It’s best to use COUNT(DISTINCT()).

SELECT COUNT(DISTINCT street_name) as "No. of Different Street Names", 
       COUNT(DISTINCT city) as "No. of Different Cities" 
FROM address;
1 records
No. of Different Street Names No. of Different Cities
958 974
knitr::include_graphics("C:/Users/gam55/Downloads/gravity_bookstore_erd.png")


SELECT 
    SUM(ol.price) AS "Sum of Orders per Customer",
    CONCAT(c.first_name, ' ', c.last_name) AS "Full Name"
FROM 
    order_line ol
INNER JOIN 
    cust_order co ON ol.order_id = co.order_id
INNER JOIN 
    customer c ON co.customer_id = c.customer_id
GROUP BY 
    c.first_name, c.last_name
ORDER BY 
    "Order per Customer" DESC;
Displaying records 1 - 10
Sum of Orders per Customer Full Name
112.44 Abbot Tesseyman
31.84 Abbott Kendrew
119.68 Abby Chevins
59.10 Abel Trower
88.54 Abigael Trowbridge
177.89 Abraham Skudder
104.84 Adah Cotty
143.15 Addie Basterfield
47.75 Addison Sigg
161.51 Adrian Kunzelmann

The AVG() function in SQL is an aggregate function that calculates the average value of a numeric column. It sums up all the values in the column and divides by the number of non-null values, providing the mean value.


SELECT 
    SUM(ol.price) AS "Sum of Orders per Customer",
    AVG(ol.price) AS "Average Order Price per Customer",
    CONCAT(c.first_name, ' ', c.last_name) AS "Full Name"
FROM 
    order_line ol
INNER JOIN 
    cust_order co ON ol.order_id = co.order_id
INNER JOIN 
    customer c ON co.customer_id = c.customer_id
GROUP BY 
    c.first_name, c.last_name
ORDER BY 
    "Order per Customer" DESC;
Displaying records 1 - 10
Sum of Orders per Customer Average Order Price per Customer Full Name
112.44 11.244000 Abbot Tesseyman
31.84 10.613333 Abbott Kendrew
119.68 13.297778 Abby Chevins
59.10 7.387500 Abel Trower
88.54 8.854000 Abigael Trowbridge
177.89 9.362632 Abraham Skudder
104.84 7.488571 Adah Cotty
143.15 10.225000 Addie Basterfield
47.75 9.550000 Addison Sigg
161.51 9.500588 Adrian Kunzelmann

The ROUND() function in SQL is used to round a numeric value to a specified number of decimal places. It takes two arguments: the number to be rounded and the number of decimal places to round to.


SELECT 
    SUM(ol.price) AS "Sum of Orders per Customer",
    ROUND(AVG(ol.price),2) AS "Average Order Price per Customer",
    CONCAT(c.first_name, ' ', c.last_name) AS "Full Name"
FROM 
    order_line ol
INNER JOIN 
    cust_order co ON ol.order_id = co.order_id
INNER JOIN 
    customer c ON co.customer_id = c.customer_id
GROUP BY 
    c.first_name, c.last_name
ORDER BY 
    "Order per Customer" DESC;
Displaying records 1 - 10
Sum of Orders per Customer Average Order Price per Customer Full Name
112.44 11.24 Abbot Tesseyman
31.84 10.61 Abbott Kendrew
119.68 13.30 Abby Chevins
59.10 7.39 Abel Trower
88.54 8.85 Abigael Trowbridge
177.89 9.36 Abraham Skudder
104.84 7.49 Adah Cotty
143.15 10.22 Addie Basterfield
47.75 9.55 Addison Sigg
161.51 9.50 Adrian Kunzelmann

Variance

The VARIANCE() function is used to calculate the statistical variance of a set of numeric values, which measures the dispersion of the values from their mean. However, SQLite does not have a built-in VARIANCE() function.

To calculate variance in SQLite, one can use a combination of SQL functions.

  • Calculating Sample Variance -

The sample variance estimates the variance from a sample of the population. The formula is -

\[ \sigma\^2 = \frac{\sum (x_i - \bar{x})^2}{n - 1} \]

  • (\[\sigma\^2\]) is the population variance

  • (\[x_i\]) represents each data point

  • (\[\bar{x}\]) is the sample mean

  • (\[n - 1\]) is the total number of data points in the population

Used when there’s only a sample and need the population variance needs to be estimated. The denominator is \[n - 1\] . \[n - 1\] instead of \[n\] to correct the bias in the estimation of the population variance from a sample (Bessel’s correction).

The SQL code -

SELECT SUM((value - avg_value) * (value - avg_value)) / (COUNT(*) - 1) AS sample_variance FROM ( SELECT value, AVG(value) AS avg_value FROM table_name ) AS subquery;


WITH OrderStats AS (
    SELECT 
        c.customer_id,
        SUM(ol.price) AS "Sum of Orders per Customer",
        AVG(ol.price) AS "Average Order Price per Customer",
        COUNT(ol.price) AS "Order Count",
        CONCAT(c.first_name, ' ', c.last_name) AS "Full Name"
    FROM 
        order_line ol
    INNER JOIN 
        cust_order co ON ol.order_id = co.order_id
    INNER JOIN 
        customer c ON co.customer_id = c.customer_id
    GROUP BY 
        c.customer_id, c.first_name, c.last_name
)
SELECT
    os."Sum of Orders per Customer",
    ROUND(os."Average Order Price per Customer", 2) AS "Average Order Price per Customer",
    os."Full Name",
    ROUND(SUM((ol.price - os."Average Order Price per Customer") * (ol.price - os."Average Order Price per Customer")) / (os."Order Count" - 1), 2) AS "Order Price Sample Variance"
FROM
    order_line ol
INNER JOIN
    cust_order co ON ol.order_id = co.order_id
INNER JOIN
    customer c ON co.customer_id = c.customer_id
INNER JOIN
    OrderStats os ON c.customer_id = os.customer_id
GROUP BY
    os.customer_id, os."Sum of Orders per Customer", os."Average Order Price per Customer", os."Full Name"
ORDER BY
    "Sum of Orders per Customer" DESC;
Displaying records 1 - 10
Sum of Orders per Customer Average Order Price per Customer Full Name Order Price Sample Variance
711.41 10.94 Rich Kirsz 32.82
663.41 10.21 Farand Tremmil 34.46
635.58 10.42 Renado Sherrington 34.50
622.31 12.20 Lynda Roseborough 26.15
611.59 10.54 Mick Sever 38.09
596.18 11.25 Alysa Crombleholme 29.67
583.81 9.73 Emylee Hubbert 37.33
574.20 10.07 La verne Figg 30.94
561.26 9.85 Zora Hurles 42.98
549.63 10.78 Penny Bovingdon 37.10
  • Calculating Population Variance -

The population variance measures the dispersion of all data points in a population from the population mean ((\[\mu\])). The formula is -

\[ \sigma\^2 = \frac{\sum (x_i - \mu)^2}{N} \]

  • (\[\sigma\^2\]) is the population variance

  • (\[x_i\]) represents each data point

  • (\[\mu\]) is the mean of the population

  • (\[n\]) is the total number of data points in the population

With large datasets, the difference between sample and population variance is minimal since the correction factor (n - 1) vs. (n) has a smaller impact as the sample size increases.

The SQL code -

SELECT AVG((value - avg_value) * (value - avg_value)) AS population_variance FROM ( SELECT value, AVG(value) AS avg_value FROM table_name ) AS subquery;


WITH OrderStats AS (
    SELECT 
        c.customer_id,
        SUM(ol.price) AS "Sum of Orders per Customer",
        AVG(ol.price) AS "Average Order Price per Customer",
        COUNT(ol.price) AS "Order Count",
        CONCAT(c.first_name, ' ', c.last_name) AS "Full Name"
    FROM 
        order_line ol
    INNER JOIN 
        cust_order co ON ol.order_id = co.order_id
    INNER JOIN 
        customer c ON co.customer_id = c.customer_id
    GROUP BY 
        c.customer_id, c.first_name, c.last_name
)
SELECT
    os."Sum of Orders per Customer",
    ROUND(os."Average Order Price per Customer", 2) AS "Average Order Price per Customer",
    os."Full Name",
    ROUND(SUM((ol.price - os."Average Order Price per Customer") * (ol.price - os."Average Order Price per Customer")) / os."Order Count", 2) AS "Order Price Population Variance"
FROM
    order_line ol
INNER JOIN
    cust_order co ON ol.order_id = co.order_id
INNER JOIN
    customer c ON co.customer_id = c.customer_id
INNER JOIN
    OrderStats os ON c.customer_id = os.customer_id
GROUP BY
    os.customer_id, os."Sum of Orders per Customer", os."Average Order Price per Customer", os."Full Name"
ORDER BY
    "Sum of Orders per Customer" DESC;
Displaying records 1 - 10
Sum of Orders per Customer Average Order Price per Customer Full Name Order Price Population Variance
711.41 10.94 Rich Kirsz 32.31
663.41 10.21 Farand Tremmil 33.93
635.58 10.42 Renado Sherrington 33.94
622.31 12.20 Lynda Roseborough 25.64
611.59 10.54 Mick Sever 37.44
596.18 11.25 Alysa Crombleholme 29.11
583.81 9.73 Emylee Hubbert 36.71
574.20 10.07 La verne Figg 30.40
561.26 9.85 Zora Hurles 42.23
549.63 10.78 Penny Bovingdon 36.37

Standard Deviation

Standard deviation measures the dispersion of a dataset relative to its mean, indicating how spread out the data points are. A low standard deviation means the data points are close to the mean, while a high standard deviation indicates they are more spread out.

SQLite does not have a built-in STDDEV() or STDEV() function for calculating standard deviation.

These are the steps to calculate standard deviation in SQLite:

  1. Calculate the mean of the dataset.

  2. Compute Squared Differences calculate the squared difference of each value from the mean.

  3. Aggregate and Calculate - Sum up the squared differences, divide by the count of values (for population standard deviation) or by (n - 1) (for sample standard deviation), and take the square root of the result.

The population standard deviation (()) is calculated using the formula:

\[ \sigma = \sqrt{\frac{\sum_{i=1}^{N} (x_i - \mu)^2}{N}} \]

where: - () is the population standard deviation, - (x_i) represents each data point, - () is the mean of the population, - (N) is the total number of data points in the population.

The SQL code -

SELECT SQRT(AVG((value - avg_value) * (value - avg_value))) AS population_stddev FROM ( SELECT value, AVG(value) OVER () AS avg_value FROM table_name ) AS subquery;

WITH OrderStats AS (
    SELECT 
        c.customer_id,
        SUM(ol.price) AS "Sum of Orders per Customer",
        AVG(ol.price) AS "Average Order Price per Customer",
        COUNT(ol.price) AS "Order Count",
        CONCAT(c.first_name, ' ', c.last_name) AS "Full Name"
    FROM 
        order_line ol
    INNER JOIN 
        cust_order co ON ol.order_id = co.order_id
    INNER JOIN 
        customer c ON co.customer_id = c.customer_id
    GROUP BY 
        c.customer_id, c.first_name, c.last_name
)
SELECT
    os."Sum of Orders per Customer",
    ROUND(os."Average Order Price per Customer", 2) AS "Average Order Price per Customer",
    os."Full Name",
    ROUND(SUM((ol.price - os."Average Order Price per Customer") * (ol.price - os."Average Order Price per Customer")) / os."Order Count", 2) AS "Order Price Population Variance",
    ROUND(SQRT(SUM((ol.price - os."Average Order Price per Customer") * (ol.price - os."Average Order Price per Customer")) / os."Order Count"), 2) AS "Order Price Standard Deviation"
FROM
    order_line ol
INNER JOIN
    cust_order co ON ol.order_id = co.order_id
INNER JOIN
    customer c ON co.customer_id = c.customer_id
INNER JOIN
    OrderStats os ON c.customer_id = os.customer_id
GROUP BY
    os.customer_id, os."Sum of Orders per Customer", os."Average Order Price per Customer", os."Full Name"
ORDER BY
    "Sum of Orders per Customer" DESC;
Displaying records 1 - 10
Sum of Orders per Customer Average Order Price per Customer Full Name Order Price Population Variance Order Price Standard Deviation
711.41 10.94 Rich Kirsz 32.31 5.68
663.41 10.21 Farand Tremmil 33.93 5.82
635.58 10.42 Renado Sherrington 33.94 5.83
622.31 12.20 Lynda Roseborough 25.64 5.06
611.59 10.54 Mick Sever 37.44 6.12
596.18 11.25 Alysa Crombleholme 29.11 5.39
583.81 9.73 Emylee Hubbert 36.71 6.06
574.20 10.07 La verne Figg 30.40 5.51
561.26 9.85 Zora Hurles 42.23 6.50
549.63 10.78 Penny Bovingdon 36.37 6.03

The sample standard deviation ((s)) is calculated using the formula:

\[ s = \sqrt{\frac{\sum_{i=1}^{n} (x_i - \bar{x})^2}{n - 1}} \]

where: - (s) is the sample standard deviation, - (x_i) represents each data point in the sample, - ({x}) is the sample mean, - (n) is the number of data points in the sample.

The SQL code -

SELECT SQRT(SUM((value - avg_value) * (value - avg_value)) / (COUNT(*) - 1)) AS sample_stddev FROM ( SELECT value, AVG(value) OVER () AS avg_value FROM table_name ) AS subquery;


WITH SampledCustomers AS (
    SELECT 
        c.customer_id,
        c.first_name,
        c.last_name
    FROM 
        customer c
    ORDER BY 
        RANDOM()
    LIMIT 50
),
OrderStats AS (
    SELECT 
        c.customer_id,
        SUM(ol.price) AS "Sum of Orders per Customer",
        AVG(ol.price) AS "Average Order Price per Customer",
        COUNT(ol.price) AS "Order Count",
        CONCAT(c.first_name, ' ', c.last_name) AS "Full Name"
    FROM 
        order_line ol
    INNER JOIN 
        cust_order co ON ol.order_id = co.order_id
    INNER JOIN 
        SampledCustomers c ON co.customer_id = c.customer_id
    GROUP BY 
        c.customer_id, c.first_name, c.last_name
)
SELECT
    os."Sum of Orders per Customer",
    ROUND(os."Average Order Price per Customer", 2) AS "Average Order Price per Customer",
    os."Full Name",
    ROUND(SUM((ol.price - os."Average Order Price per Customer") * (ol.price - os."Average Order Price per Customer")) / (os."Order Count" - 1), 2) AS "Order Price Sample Variance",
    ROUND(SQRT(SUM((ol.price - os."Average Order Price per Customer") * (ol.price - os."Average Order Price per Customer")) / (os."Order Count" - 1)), 2) AS "Order Price Sample Standard Deviation"
FROM
    order_line ol
INNER JOIN
    cust_order co ON ol.order_id = co.order_id
INNER JOIN
    customer c ON co.customer_id = c.customer_id
INNER JOIN
    OrderStats os ON c.customer_id = os.customer_id
GROUP BY
    os.customer_id, os."Sum of Orders per Customer", os."Average Order Price per Customer", os."Full Name"
ORDER BY
    "Sum of Orders per Customer" DESC;
Displaying records 1 - 10
Sum of Orders per Customer Average Order Price per Customer Full Name Order Price Sample Variance Order Price Sample Standard Deviation
516.90 11.24 Enrichetta Morrill 35.84 5.99
383.98 10.67 Chrissy Notton 32.56 5.71
337.77 10.56 Lianne Gorry 32.20 5.67
316.29 9.88 Bethina Cady 30.98 5.57
302.11 9.15 Lucas Wyldbore 39.15 6.26
288.05 8.73 Rustin Cadden 39.63 6.30
259.36 8.94 Rodd Diplock 23.67 4.87
254.38 11.56 Morris Morales 42.79 6.54
180.90 8.61 Fania Jeanesson 27.50 5.24
170.70 10.67 Danice Matthiesen 37.23 6.10

6 - WITH() function

WITH(), also known as Common Table Expressions (CTEs), allows for improved readability and reusability of SQL queries. It’s particularly useful for breaking down complex queries into simpler, more manageable parts by creating temporary result sets that can be referenced within the main query.

WITH cte_name AS ( – CTE Query SELECT … ) SELECT … FROM cte_name

Key Differences

  1. Structure and Readability -

Subqueries - Can be less readable, especially when nested.

WITH() - Provide a clearer, more structured approach by defining temporary tables with meaningful names.

  1. Reusability -

Subqueries - Generally not reusable; you need to repeat the subquery if it’s used in multiple places.

WITH() - Reusable within the main query, reducing redundancy and improving maintainability.

  1. Debugging and Maintenance -

Subqueries - Harder to debug and maintain due to their nested nature.

WITH() - Easier to debug and maintain due to their clear, modular structure.

When using WITH()

WITH CustomerOrders AS (
    SELECT 
        co.customer_id, 
        co.order_date,
        ol.order_id,
        SUM(ol.price) AS "Total Order Amount"
    FROM 
        cust_order co
    INNER JOIN 
        order_line ol ON co.order_id = ol.order_id
    WHERE 
        co.order_date BETWEEN '2023-11-01' AND '2024-05-01'
    GROUP BY 
        co.customer_id, co.order_date, ol.order_id
)
SELECT
    c.customer_id AS "Customer ID",
    c.order_date AS "Order Date",
    COUNT(c.order_id) AS "Number of Orders",
    SUM(c."Total Order Amount") AS "Total Amount Spent",
    AVG(c."Total Order Amount") AS "Average Order Amount"
FROM
    CustomerOrders c
GROUP BY
    c.customer_id
ORDER BY
    c.order_date DESC;
Displaying records 1 - 10
Customer ID Order Date Number of Orders Total Amount Spent Average Order Amount
383 2024-04-30 16:41:12 1 28.12 28.12
1013 2024-04-29 01:27:43 1 12.34 12.34
1436 2024-04-27 15:36:45 1 15.86 15.86
933 2024-04-27 00:04:23 1 15.67 15.67
975 2024-04-26 09:37:25 1 17.40 17.40
557 2024-04-24 21:49:03 1 1.16 1.16
1002 2024-04-24 18:23:20 1 26.49 26.49
1103 2024-04-23 05:28:36 1 8.94 8.94
1157 2024-04-23 02:15:30 1 52.09 52.09
110 2024-04-22 16:05:31 1 30.88 30.88

The part with the WITH inner query -

  • This CTE calculates the total amount spent on each order for every customer within the specified date range.

  • It joins the cust_order and order_line tables and groups the data by customer ID, order date, and order ID to compute the total order amount.

The rest, the outer query -

  • The main query selects from the CustomerOrders CTE.

  • It calculates the number of orders, total amount spent, and average order amount for each customer.

  • The results are grouped by customer ID and ordered by the order date in descending order.

When not using WITH()

SELECT 
    co.customer_id AS "Customer ID",
    co.order_date AS "Order Date",
    COUNT(orders_per_customer.order_id) AS "Number of Orders",
    SUM(orders_per_customer."Total Order Amount") AS "Total Amount Spent",
    AVG(orders_per_customer."Total Order Amount") AS "Average Order Amount"
FROM 
    cust_order co
INNER JOIN (
    SELECT 
        co_inner.customer_id, 
        co_inner.order_date,
        ol_inner.order_id,
        SUM(ol_inner.price) AS "Total Order Amount"
    FROM 
        cust_order co_inner
    INNER JOIN 
        order_line ol_inner ON co_inner.order_id = ol_inner.order_id
    WHERE 
        co_inner.order_date BETWEEN '2023-11-01' AND '2024-05-01'
    GROUP BY 
        co_inner.customer_id, co_inner.order_date, ol_inner.order_id
) AS orders_per_customer ON co.customer_id = orders_per_customer.customer_id AND co.order_date = orders_per_customer.order_date
WHERE 
    co.order_date BETWEEN '2023-11-01' AND '2024-05-01'
GROUP BY 
    co.customer_id, co.order_date
ORDER BY 
    co.order_date DESC;
Displaying records 1 - 10
Customer ID Order Date Number of Orders Total Amount Spent Average Order Amount
1240 2024-04-30 23:45:50 1 38.34 38.34
241 2024-04-30 23:18:12 1 17.15 17.15
155 2024-04-30 21:43:09 1 50.45 50.45
59 2024-04-30 19:32:57 1 16.54 16.54
79 2024-04-30 19:09:28 1 3.81 3.81
109 2024-04-30 19:01:49 1 29.51 29.51
1367 2024-04-30 18:50:07 1 31.52 31.52
21 2024-04-30 17:35:08 1 9.77 9.77
45 2024-04-30 17:34:56 1 9.68 9.68
383 2024-04-30 16:41:12 1 28.12 28.12

The part with the INNER JOIN inner query -

  • The inner query calculates the total amount spent on each order for every customer within the specified date range.

  • This inner query is essentially the same as the WITH part in the previous example.

The rest, the outer query -

  • The outer query joins the cust_order table with the result of the inner query (orders_per_customer).

  • It calculates the number of orders, total amount spent, and average order amount for each customer, grouped by customer ID and order date.

  • The results are ordered by the order date in descending order.

The advantages to using WITH()

  • Readability - Using the WITH clause (CTE) makes the query more readable.In the non-CTE version, the nested query can be harder to follow.

  • Maintainability - The CTE version is more maintainable because each part of the query is isolated. If you need to adjust the calculation logic, it’s clearer where to make changes.

  • Length - The non-CTE version tends to be more long-winded, as it requires embedding the subquery directly in the FROM clause, making the overall query longer and potentially more confusing.

7 - Further Data Manipulation in SQL

the ‘like’ operator

knitr::include_graphics("C:/Users/gam55/Downloads/gravity_bookstore_erd.png")

The LIKE operator is used for pattern matching within text fields. It allows the use of % to represent zero or more characters and _ to represent a single character. For example, the query SELECT * FROM customers WHERE name LIKE '%enko'; will return all rows where the name column ends with “enko”.

Getting all surnames starting with ‘Mac’ -

select * from customer where last_name like 'Mac%'
Displaying records 1 - 10
customer_id first_name last_name email
114 Aurelie MacSherry
211 Desiri MacDunleavy
299 Yvonne Maccree
416 Zaria MacCafferky
481 Kaleb MacClenan
500 Aurelia MacGowan
604 Agace MacKinnon
951 Wilma MacGorrie
979 Danell Mace
994 Archibold MacNab

Getting all surnames starting with ‘Mc’ -

select * from customer where last_name like 'Mc%'
Displaying records 1 - 10
customer_id first_name last_name email
73 Ruthi McGeever
168 Lynsey McPeice
187 Miner McLay
199 Bjorn McCloud
317 Stesha McAlees
319 Win McManamon
347 Ephrem Mc Ilwrick
410 Rriocard McPhail
539 Jane McCreath
772 Tybalt McOwen

Getting all surnames ending with ‘vich’ -

select * from customer where last_name like '%vich'
2 records
customer_id first_name last_name email
578 Lynea Matskevich
1545 Deana Matusevich

Getting all surnames ending with ‘enko’ -

select * from customer where last_name like '%enko'
2 records
customer_id first_name last_name email
174 Zachery Hriinchenko
1062 Harold Izacenko

Getting all surnames starting or ending with ‘man’ (lower or upper case) -

select * from customer where last_name like '%man%'
Displaying records 1 - 10
customer_id first_name last_name email
218 Abbot Tesseyman
221 Parker Strangman
319 Win McManamon
360 Dulci Portman
366 Giulia Borrowman
402 Charlean Palphramand
413 Sigfried Mansel
430 Adrian Kunzelmann
483 Antonie Liebmann
501 Salome Elliman

Making a new column specifically based on the like condition -


SELECT 
    last_name as "Customer Surname", 
    CASE 
        WHEN last_name LIKE '%enko' 
            OR last_name LIKE '%vich' 
            OR last_name LIKE '%vych' 
            OR last_name LIKE '%chuk' 
            OR last_name LIKE '%chyk' 
            OR last_name LIKE '%ski' 
            OR last_name LIKE '%sky' 
            OR last_name LIKE '%uk' 
            OR last_name LIKE '%ko' 
            OR last_name LIKE '%yshyn' 
            OR last_name LIKE '%iv' THEN 'Yes' 
        ELSE 'No' 
    END AS 'Potentially Ukrainian Surname?' 
FROM customer 
WHERE last_name LIKE '%enko' 
    OR last_name LIKE '%vich' 
    OR last_name LIKE '%vych' 
    OR last_name LIKE '%chuk' 
    OR last_name LIKE '%chyk' 
    OR last_name LIKE '%ski' 
    OR last_name LIKE '%sky' 
    OR last_name LIKE '%uk' 
    OR last_name LIKE '%ko' 
    OR last_name LIKE '%yshyn' 
    OR last_name LIKE '%iv';
Displaying records 1 - 10
Customer Surname Potentially Ukrainian Surname?
Hriinchenko Yes
Davidofski Yes
Bagniuk Yes
Georgievski Yes
Matskevich Yes
Malinowski Yes
Matschuk Yes
Izacenko Yes
Dmiterko Yes
Matusevich Yes

Unfortunately, SQLite that’s being used here does not support the command ‘similar to’ that could be used in PostGreSQL.

If PostGreSQL was being used, the above query could be shortened a lot -

SELECT last_name, CASE WHEN last_name SIMILAR TO ‘%(enko|vich|vych|chuk|chyk|ski|sky|uk|ko|yshyn|iv)’ THEN ‘YES’ ELSE ‘NO’ END AS “Ukrainian Surname” FROM customer WHERE last_name SIMILAR TO ‘%(enko|vich|vych|chuk|chyk|ski|sky|uk|ko|yshyn|iv)’;

SOUNDEX

knitr::include_graphics("C:/Users/gam55/Downloads/gravity_bookstore_erd.png")

SOUNDEX is a phonetic algorithm that indexes words by their sound when pronounced in English. This can be useful for matching words that sound similar but are spelled differently. SQLite has this function.


SELECT SOUNDEX('George');
1 records
SOUNDEX(‘George’)
G620

SELECT last_name AS 'Surnames' 
FROM customer
WHERE SOUNDEX(last_name) = SOUNDEX('Smyth');
2 records
Surnames
Smoth
Sineath

select soundex('Postgres'), soundex('Postgresss'), ('Postgres' = 'Postgresss'),
soundex('Postgres') = soundex('Postgresss')
1 records
soundex(‘Postgres’) soundex(‘Postgresss’) (‘Postgres’ = ‘Postgresss’) soundex(‘Postgres’) = soundex(‘Postgresss’)
P232 P232 0 1

A SOUNDEX code consists of a letter followed by three digits, representing the phonetic pattern of the word. Here’s how it is constructed:

  • First Letter: The first letter of the word is kept as the first letter of the SOUNDEX code.

  • Digits: The remaining letters are converted to digits based on their phonetic sound:

B, F, P, V → 1

C, G, J, K, Q, S, X, Z → 2

D, T → 3

L → 4

M, N → 5

R → 6

  • Similar Sounds: Adjacent letters that represent the same sound are collapsed into a single digit.

  • Vowels and Certain Letters: A, E, I, O, U, H, W, and Y are ignored unless they are the first letter.

  • Truncation/Zero Padding: The code is truncated to four characters if necessary, or zero-padded to ensure it is four characters long.

  • For example, the SOUNDEX code “P232” is generated from the word “Postgres” as follows:

    ‘P’ is the first letter.

    ‘o’ is ignored.

    ‘s’ maps to 2.

    ‘t’ maps to 3.

    ‘g’ maps to 2.

    Remaining letters (‘r’, ‘e’, ‘s’) are either ignored or do not change the pattern as the code is already four characters long.

Thus, “Postgres” becomes “P232”.


select difference ('Postgres', 'Postgresss') as "Difference between the strings Postgres and Postgresss"
1 records
Difference between the strings Postgres and Postgresss
4

The DIFFERENCE() function in SQL compares the SOUNDEX values of two strings and returns an integer value between 0 and 4, indicating the degree of similarity between the two strings. A result of 4 means the strings sound very similar, while a result of 0 means they sound very different. This function is particularly useful for fuzzy matching in text searches.

The levenshtein() function calculates the Levenshtein distance between two strings, which is the minimum number of single-character edits (insertions, deletions, or substitutions) required to change one string into the other. It is commonly used to measure the similarity between two strings, with a lower distance indicating greater similarity.

SQLite doesn’t have it unfortunately!

8 - Further Filtering & Aggregation in SQL

knitr::include_graphics("C:/Users/gam55/Downloads/gravity_bookstore_erd.png")

The HAVING Clause

SELECT 
    SUM(ol.price) AS "Sum of Orders per Customer",
    CONCAT(c.first_name, ' ', c.last_name) AS "Full Name"
FROM 
    order_line ol
INNER JOIN 
    cust_order co ON ol.order_id = co.order_id
INNER JOIN 
    customer c ON co.customer_id = c.customer_id
GROUP BY 
    c.first_name, c.last_name
ORDER BY 
    "Sum of Orders per Customer" DESC;
Displaying records 1 - 10
Sum of Orders per Customer Full Name
711.41 Rich Kirsz
663.41 Farand Tremmil
635.58 Renado Sherrington
622.31 Lynda Roseborough
611.59 Mick Sever
596.18 Alysa Crombleholme
583.81 Emylee Hubbert
574.20 La verne Figg
561.26 Zora Hurles
549.63 Penny Bovingdon

The HAVING clause is used to specify a condition for groups of rows created by the GROUP BY clause, similar to how the WHERE clause is used to specify a condition for individual rows.

SELECT 
    SUM(ol.price) AS "Sum of Orders per Customer",
    CONCAT(c.first_name, ' ', c.last_name) AS "Full Name"
FROM 
    order_line ol
INNER JOIN 
    cust_order co ON ol.order_id = co.order_id
INNER JOIN 
    customer c ON co.customer_id = c.customer_id
GROUP BY 
    c.first_name, c.last_name
HAVING
    "Sum of Orders per Customer" > 100
ORDER BY 
    "Sum of Orders per Customer" DESC;
Displaying records 1 - 10
Sum of Orders per Customer Full Name
711.41 Rich Kirsz
663.41 Farand Tremmil
635.58 Renado Sherrington
622.31 Lynda Roseborough
611.59 Mick Sever
596.18 Alysa Crombleholme
583.81 Emylee Hubbert
574.20 La verne Figg
561.26 Zora Hurles
549.63 Penny Bovingdon

CUBE

knitr::include_graphics("C:/Users/gam55/Downloads/gravity_bookstore_erd.png")

SQLite does not support the CUBE operation directly. The CUBE operation, found in SQL databases like PostgreSQL, SQL Server, and Oracle, is used for generating a result set that represents a multi-dimensional cube for aggregation purposes.

It generates a result set that represents all possible combinations of grouping columns, providing comprehensive aggregate data for every combination.

When you use CUBE, SQL automatically performs aggregations for each combination of the specified dimensions (columns).

However, you can manually create similar results in SQLite using a combination of GROUP BY queries and UNION operations to simulate a CUBE.


SELECT 
    b.title AS "Book Title",
    a.city AS "Order Destination City",
    c.country_name AS "Order Destination Country", 
    ol.price AS "Price"
FROM 
    book b
INNER JOIN 
    order_line ol ON ol.book_id = b.book_id
INNER JOIN 
    cust_order co ON ol.order_id  = co.order_id
INNER JOIN
    address a ON co.dest_address_id = a.address_id
INNER JOIN
    country c ON a.country_id = c.country_id
;
Displaying records 1 - 10
Book Title Order Destination City Order Destination Country Price
The World’s First Love: Mary Mother of God Rameshki Russia 10.23
The Illuminati Youwarou Mali 18.14
The Illuminati Pagaden Indonesia 16.78
The Servant Leader Trongsa Bhutan 7.42
Cliffs Notes on Aristophanes’ Lysistrata The Birds The Clouds The Frogs Badong Indonesia 18.76
Cliffs Notes on Aristophanes’ Lysistrata The Birds The Clouds The Frogs Krajan Sale Indonesia 1.58
Cliffs Notes on Aristophanes’ Lysistrata The Birds The Clouds The Frogs Liwu China 10.03
Life Is a Dream and Other Spanish Classics (Eric Bentley’s Dramatic Repertoire) - Volume II Gaoshan China 13.72
Life Is a Dream and Other Spanish Classics (Eric Bentley’s Dramatic Repertoire) - Volume II Chengkou China 12.25
Life Is a Dream and Other Spanish Classics (Eric Bentley’s Dramatic Repertoire) - Volume II Ploso Wetan Indonesia 1.48

To replicate CUBE with the above query, the following steps are done in a query below:

  • Detailed Grouping - Group by Book Title, Order Destination City, and Order Destination Country.

  • Partial Groupings - Group by each pair of dimensions and each individual dimension. Book Title and Order Destination City, Book Title and Order Destination Country, Order Destination City and Order Destination Country, Book Title only, Order Destination City only, Order Destination Country only.

Grand Total: Aggregate without any grouping for the overall total.

-- Group by all three dimensions
SELECT 
    b.title AS "Book Title",
    a.city AS "Order Destination City",
    c.country_name AS "Order Destination Country", 
    SUM(ol.price) AS "Total Price"
FROM 
    book b
INNER JOIN 
    order_line ol ON ol.book_id = b.book_id
INNER JOIN 
    cust_order co ON ol.order_id = co.order_id
INNER JOIN
    address a ON co.dest_address_id = a.address_id
INNER JOIN
    country c ON a.country_id = c.country_id
GROUP BY
    b.title, a.city, c.country_name

UNION ALL

-- Group by Book Title and Order Destination City
SELECT 
    b.title AS "Book Title",
    a.city AS "Order Destination City",
    NULL AS "Order Destination Country", 
    SUM(ol.price) AS "Total Price"
FROM 
    book b
INNER JOIN 
    order_line ol ON ol.book_id = b.book_id
INNER JOIN 
    cust_order co ON ol.order_id = co.order_id
INNER JOIN
    address a ON co.dest_address_id = a.address_id
GROUP BY
    b.title, a.city

UNION ALL

-- Group by Book Title and Order Destination Country
SELECT 
    b.title AS "Book Title",
    NULL AS "Order Destination City",
    c.country_name AS "Order Destination Country", 
    SUM(ol.price) AS "Total Price"
FROM 
    book b
INNER JOIN 
    order_line ol ON ol.book_id = b.book_id
INNER JOIN 
    cust_order co ON ol.order_id = co.order_id
INNER JOIN
    address a ON co.dest_address_id = a.address_id
INNER JOIN
    country c ON a.country_id = c.country_id
GROUP BY
    b.title, c.country_name

UNION ALL

-- Group by Order Destination City and Order Destination Country
SELECT 
    NULL AS "Book Title",
    a.city AS "Order Destination City",
    c.country_name AS "Order Destination Country", 
    SUM(ol.price) AS "Total Price"
FROM 
    book b
INNER JOIN 
    order_line ol ON ol.book_id = b.book_id
INNER JOIN 
    cust_order co ON ol.order_id = co.order_id
INNER JOIN
    address a ON co.dest_address_id = a.address_id
INNER JOIN
    country c ON a.country_id = c.country_id
GROUP BY
    a.city, c.country_name

UNION ALL

-- Group by Book Title only
SELECT 
    b.title AS "Book Title",
    NULL AS "Order Destination City",
    NULL AS "Order Destination Country", 
    SUM(ol.price) AS "Total Price"
FROM 
    book b
INNER JOIN 
    order_line ol ON ol.book_id = b.book_id
INNER JOIN 
    cust_order co ON ol.order_id = co.order_id
INNER JOIN
    address a ON co.dest_address_id = a.address_id
INNER JOIN
    country c ON a.country_id = c.country_id
GROUP BY
    b.title

UNION ALL

-- Group by Order Destination City only
SELECT 
    NULL AS "Book Title",
    a.city AS "Order Destination City",
    NULL AS "Order Destination Country", 
    SUM(ol.price) AS "Total Price"
FROM 
    book b
INNER JOIN 
    order_line ol ON ol.book_id = b.book_id
INNER JOIN 
    cust_order co ON ol.order_id = co.order_id
INNER JOIN
    address a ON co.dest_address_id = a.address_id
INNER JOIN
    country c ON a.country_id = c.country_id
GROUP BY
    a.city

UNION ALL

-- Group by Order Destination Country only
SELECT 
    NULL AS "Book Title",
    NULL AS "Order Destination City",
    c.country_name AS "Order Destination Country", 
    SUM(ol.price) AS "Total Price"
FROM 
    book b
INNER JOIN 
    order_line ol ON ol.book_id = b.book_id
INNER JOIN 
    cust_order co ON ol.order_id = co.order_id
INNER JOIN
    address a ON co.dest_address_id = a.address_id
INNER JOIN
    country c ON a.country_id = c.country_id
GROUP BY
    c.country_name

UNION ALL

-- Grand total
SELECT 
    NULL AS "Book Title",
    NULL AS "Order Destination City",
    NULL AS "Order Destination Country", 
    SUM(ol.price) AS "Total Price"
FROM 
    book b
INNER JOIN 
    order_line ol ON ol.book_id = b.book_id
INNER JOIN 
    cust_order co ON ol.order_id = co.order_id
INNER JOIN
    address a ON co.dest_address_id = a.address_id
INNER JOIN
    country c ON a.country_id = c.country_id;
Displaying records 1 - 10
Book Title Order Destination City Order Destination Country Total Price
$30 Film School: How to Write Direct Produce Shoot Edit Distribute Tour With and Sell Your Own No-Budget Digital Movie Filimonovo Russia 7.93
$30 Film School: How to Write Direct Produce Shoot Edit Distribute Tour With and Sell Your Own No-Budget Digital Movie Oehala Indonesia 16.83
$30 Film School: How to Write Direct Produce Shoot Edit Distribute Tour With and Sell Your Own No-Budget Digital Movie Oguma Nigeria 17.68
$30 Film School: How to Write Direct Produce Shoot Edit Distribute Tour With and Sell Your Own No-Budget Digital Movie San José Poaquil Guatemala 7.31
10 lb Penalty Xinshancun China 7.22
100 Great Fantasy Short Short Stories Rodotópi Greece 2.24
100 Great Fantasy Short Short Stories Villa Bustos Argentina 5.08
100 Years of Lynchings Taloqan Afghanistan 0.03
100 Years of Lynchings Wąbrzeźno Poland 19.77
1000 Record Covers Bulakbanjar Indonesia 10.83

If CUBE was available, much shorter code could be employed -

SELECT b.title AS “Book Title”, a.city AS “Order Destination City”, c.country_name AS “Order Destination Country”, SUM(ol.price) AS “Total Price” FROM book b INNER JOIN order_line ol ON ol.book_id = b.book_id INNER JOIN cust_order co ON ol.order_id = co.order_id INNER JOIN address a ON co.dest_address_id = a.address_id INNER JOIN country c ON a.country_id = c.country_id GROUP BY CUBE(b.title, a.city, c.country_name);

Why use CUBE, what’s the point?

Using the CUBE operator in SQL is valuable for data analysis.

Comprehensive Aggregation:
    CUBE provides a complete set of aggregations across all combinations of specified dimensions. This includes subtotals for every combination of the dimensions and a grand total.

Simplifies Query Writing:
    Instead of writing multiple GROUP BY queries with UNION ALL, a single query with CUBE handles all required groupings and aggregations.

Facilitates Data Exploration:
    By generating all possible subtotals, CUBE allows for easy exploration of data across different levels of granularity.

Supports OLAP Operations:
    CUBE is particularly useful in Online Analytical Processing (OLAP) systems, enabling complex analytical queries on data warehouses.

Use Cases and Next Steps After Using CUBE

Data Reporting:
    Use the result set from a CUBE operation to create comprehensive reports that show detailed and aggregated information. This is useful for business intelligence tools and dashboards.

Pivot Tables:
    The result can be used to create pivot tables in tools like Excel or Tableau, allowing users to interactively explore data across different dimensions.

Trend Analysis:
    Analyze trends and patterns by examining the subtotals and grand totals provided by the CUBE output.

Performance Metrics:
    Calculate key performance indicators (KPIs) at various levels of detail, providing insights into different aspects of the business or operation.

Anomaly Detection:
    Identify anomalies or outliers by comparing aggregated data at different levels of detail.